Overview of Data Sets

We are analyzing the quality of wine based on 11 predictors: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol. Quality, in this instance, is defined as an indicator of its craftsmanship and, thus, desirability which can be used, for example, in pricing. Quality does not necessarily indicate if a wine has gone bad. We will use two data sets – one for white wine and the other for red – to create two respective machine learning models. There are 4899 observations in the white wine data set and 1600 observations in the red wine data set.


Overview of Research Questions

We are interested in conducting analysis on both the white and red wine to assess the quality of wine in new wines in production so that they may be accurately priced and marketed.


Loading Data

This project uses data on the white and red wine, which records information of the chemical makeup of the wine.

white_og <- read.csv("Wine Dataset/winequality-white.csv", sep = ";")
red_og <- read.csv("Wine Dataset/winequality-red.csv", sep = ";")


Data Cleaning

To clean our data, we clean the column names, change quality into a factor so we can analyze it with classification models, and add a type of White or Red to each data set. We removed any rows in the white wine data set with a quality value of 9 because there are too few instances, and thus, it inhibits our models from performing correctly later on. The red wine data set did not have any quality values of 9. We also create a combined data frame with both values from white and red wine to see if there are significant differences between red and white evaluations for quality.

Data Split

In our data split, we put a proportion of .7 of each original data set into a training data set and a proportion of .3 into the testing data sets, stratifying by quality. In this section, we also folded our data into 5 folds for later cross validation use.

set.seed(1234)
white_split <- white %>% 
  initial_split(prop = 0.7, strata = "quality")

white_train <- training(white_split)
white_test <- testing(white_split)

red_split <- red %>% 
  initial_split(prop = 0.7, strata = "quality")

red_train <- training(red_split)
red_test <- testing(red_split)

white_fold <- vfold_cv(white_train, v = 5)
red_fold <- vfold_cv(red_train, v = 5)

Exploratory Data Analysis

What sort of factors do winemakers and sommeliers look for in a quality wine? Generally, quality is determined by acidity, dryness, flavor profile or taste, alcohol content, and how well the wine is preserved or how it changes as it is stored. In our exploratory data analysis, we will analyze our predictors based on these five categories. First, acidity levels can be summarized through the ph levels, fixed.acidity, volatile.acidity, and citric.acid content. Dryness is determined by the density. Taste can be broken down into sweetness and saltiness, which are caused by residual.sugar and chlorides respectively. We will analyze alcohol content singularly to see its effect on the wine quality. Lastly, sulfurous compounds are what is generally used to preserve wine, so we will analyze free.sulfur.dioxide, total.sulfur.dioxide, and sulphates to see if the way a wine is preserved interacts with wine quality in an interesting way.

Our data can be split into two data sets because experts look for different levels of acidity, sugar, etc. for white wine and red wine. Thus, we will have 3 different representations of the data: one for white wine, one for red wine, and one for both.

All of our predictors are continuous, so we will use box plots, histograms, and scatter plots to visualize our data and determine feature selection.

First, let’s see the distribution of quality between both data sets of wine.

ggplot(combinedWine_og, aes(quality)) + geom_bar(color = "black", fill = "pink") + labs(title = "Histogram of Quality - Total Wine") + xlab("Quality of Wine") + ylab("Count") 

We can see that it is normally distributed, meaning that most wine has a quality value of 5 or 6, with few exceptionally good wines having a value of 8 or 9, and low quality wines having a quality value of 3. Based on their low frequency, we can further justify selecting against of quality values of 9 in our initial data cleaning.

Next, we look at the correlation matrices for white and red wine separately to determine which predictors are correlated.

Correlation Matrices

White

white %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  corrplot(type = 'lower', diag = FALSE, 
           method = 'color', main = 'White Wine Correlation Plot')

Red

red %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  corrplot(type = 'lower', diag = FALSE, 
           method = 'color', main = 'Red Wine Correlation Plot')

In the white wine correlation matrix, density and residual sugar; and density and alcohol are the predictors with the highest correlation. Total sulfur dioxide and free sulfur dioxide also have a moderate correlation.

In the red wine correlation matrix, citric acid and fixed acidity; density and fixed acidity; citric acid and volatile acidity; pH and fixed acidity; and free sulfur dioxide and total sulfur dioxide are highly correlated with each other.

Scatter plots

To visualize and validate these correlations, let’s take a look at the scaled scatter plot of each predictor plotted against its correlated counterpart.

White

# scaled data sets 
scaled_white = as.data.frame(scale(select(white, c(-quality,-type))))
scaled_red = as.data.frame(scale(select(red, c(-quality,-type))))

# scaled white residual sugar versus density 
ggplot(scaled_white, aes(x = residual.sugar, y = density)) + geom_point()+scale_x_continuous(name = "Residual Sugar") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle(" Residual Sugar Versus Density") + theme(plot.title = element_text(size = 20))

# scaled white alcohol versus density 
ggplot(scaled_white, aes(x = alcohol, y = density)) + geom_point()+scale_x_continuous(name = "Alcohol") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Alcohol Versus Density") + theme(plot.title = element_text(size = 20))

# scaled white free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_white, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 20))

Red

# scaled red volatile acidity versus citric acid
ggplot(scaled_red, aes(x = volatile.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Volatile Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Volatile Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 20))

# scaled red fixed acidity versus citric acid 
ggplot(scaled_red, aes(x = fixed.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 20))

# scaled red fixed acidity versus pH
ggplot(scaled_red, aes(x = fixed.acidity, y = pH)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "pH") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus pH") + theme(plot.title = element_text(size = 20))

# scaled red fixed acidity versus density
ggplot(scaled_red, aes(x = fixed.acidity, y = density)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Density") + theme(plot.title = element_text(size = 20))

# scaled red free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_red, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Dioxide Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 20))

Based on the scatter plots, we can visualize the correlations between the predictors. For example, for white wine, density has a strong positive correlation with residual sugar and a moderate negative correlation with alcohol. Through these scatter plots, we confirm the existence of correlations predicted by our initial correlation matrix.

Now, we can take a look at the box plots for several of our predictors to see the ways that they interact with wine quality, isolated from the other predictors. First, we will visualize acidity levels which can be measured through fixed acidity, volatile acidity, and citric acid levels. As shown above in the scatter plots, these three predictors are highly correlated with each other in red wine.

#fixed acidity
ggplot(combinedWine, mapping = aes(x = `fixed.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Fixed Acidity Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) +coord_cartesian( xlim = c(0,16), ylim = NULL, default = FALSE )

# volatile acidity 
ggplot(combinedWine, mapping = aes(x = `volatile.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Volatile Acidity Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )

# citric acid 
ggplot(combinedWine, mapping = aes(x = `citric.acid`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Citric Acid Levels Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )

From these box plots, we can see that fixed acidity levels are relatively consistent in both red and white wine. Volatile acidity has a negative correlation with quality in red wine, but relatively consistent averages for each level of wine quality in white wine. Citric acid levels in red wine have a stronger positive correlation than in white wine. In general, we can see that acidity levels fluctuate more in red wine than in white wine.

Next, let’s take a look at dryness which is determined by the predictor density. Based on the correlation matrix and scatter plots, density also is correlated with residual sugar and alcohol in white wine and with fixed acidity in red wine.

# density
ggplot(combinedWine, mapping = aes(x = density, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = density, y = quality)) + labs(title = "Red and White Density Levels versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(.985,1.01), ylim = NULL, default = FALSE )

Although density is correlated with several predictors according to the correlation matrices and scatter plots, in this box plot, we can see that density stays relatively consistent, around 1, for each level of wine quality.

Next, we will look at the taste of the wine, which is determined by levels of sweetness and saltiness. These are affected by sugar levels and chlorides respectively.

#sugar content
ggplot(combinedWine, mapping = aes(x = residual.sugar, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = residual.sugar, y = quality)) + labs(title = "Red and White Residual Sugar Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,25), ylim = NULL, default = FALSE )

# we removed outliers to ensure that the variation was not due to the outliers 
ggplot(red[red$free.sulfur.dioxide < 50,], aes(x = free.sulfur.dioxide, y = quality)) + 
  geom_boxplot(aes(fill = quality)) +
  labs(title = "Free Sulfur Dioxide for Red Wine", x = "Free Sulfur Dioxide", y = "Quality") +
  geom_point(width = 0.15) +
  scale_fill_brewer(palette = "RdPu")

#chlorides
ggplot(combinedWine, mapping = aes(x = chlorides, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = chlorides, y = quality)) + labs(title = "Red and White Chloride Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,0.3), ylim = NULL, default = FALSE )

White wine has, on average, higher and more variable sugar levels than red wine while red wine has an on average higher chloride content than white wine. There seem to be a higher number of outliers in the values of chloride.

Next, we will analyze alcohol content, which can affect the taste of the wine as well.

ggplot(combinedWine, mapping = aes(x = alcohol, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Alcohol Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) 

There is generally higher alcohol content associated with a wine of higher quality but there is not a significant different in averages between red wine and white wine.

Lastly, let’s look at the preservative content, which is determined by free sulfur dioxide, total sulfur dioxide, and sulfates.

#free sulfur dioxide
ggplot(combinedWine, mapping = aes(x = free.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = free.sulfur.dioxide, y = quality)) + labs(title = "Red and White Free Sulfur Dioxide Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,150), ylim = NULL, default = FALSE )

#total sulfur dioxide
ggplot(combinedWine, mapping = aes(x = total.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = total.sulfur.dioxide, y = quality)) + labs(title = "Red and White Total Sulfur Dioxide Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,300), ylim = NULL, default = FALSE )

#sulfates
ggplot(combinedWine, mapping = aes(x = sulphates, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = sulphates, y = quality)) + labs(title = "Red and White Sulfate Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1.5), ylim = NULL, default = FALSE )

From the box plots we can see that red wine generally has a lower sulfur dioxide content than white wine. Also, averages across each stratification of quality have similar values except for sulfates in red wine, which have a slight positive correlation with quality.

CONCLUDING SENTENCE; CONNECT TO HOW KNOWING THESE THINGS AFFECTS OUR MODELLING


Loading in Saved Models

We did all of our modeling in R-Scripts for efficiency purposes, since models generally take a long time to run. We will load all of results here, and intermittently call variables throughout the report to visualize our calculations.

# LDA 
load("WhiteLDA.rda")
load("RedLDA.rda")

# LDA with PCA
load("WhiteLDAPCA.rda")
load("RedLDAPCA.rda")

# Decision Tree
load("WhiteWineDecisionTree.rda")
load("RedWineDecisionTree.rda")

# Random Forest 
load("WhiteWineRandomForest.rda")
load("RedWineRandomForest.rda")

# Boosted Trees
load("WhiteWineBoostedTrees.rda")
load("RedWineBoostedTrees.rda")

Model Fitting for White Wine

We will be fitting linear discriminant analysis, naive Bayes, single decision tree, random forest, and boosted tree models and compare accuracy metrics. Then, we will fit the three models with the best roc_auc to our testing data. First, let’s see how the models perform on the white wine data set.

Recipe

white_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())

K-Fold Cross Validation

Let’s first explore linear discriminant analysis and naive Bayes classification through k-fold cross validation.

Linear Discriminant Analysis

For the linear discriminant analysis model, we use a classification mode and set the engine to MASS. We then add the model and recipe to a workflow and create a fit between the workflow and folded data. We are using roc_auc to evaluate accuracy.

#set up model with mode classification and engine MASS
wlda_model <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS") 

#add model and recipe to the workflow
wlda_wkflow<- workflow() %>% 
  add_model(wlda_model) %>% 
  add_recipe(white_recipe)

#create a fit between the workflow and folded data
wlda_fit_cross <- fit_resamples(wlda_wkflow, white_fold)

#determine the roc_auc of the LDA model on the folded training data
collect_metrics(wlda_fit_cross)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.528     5 0.00640 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.735     5 0.0159  Preprocessor1_Model1

Now, let’s take a look at how our cross validation method works with a Naive Bayes model. In particular, let’s see if the accuracy increases.

Naive Bayes

#set up model with mode classification and engine kLaR
#we used set_args(use_kernel = FALSE) based on Lab 3
wnb_mod <- naive_Bayes() %>% 
  set_mode("classification") %>% 
  set_engine("klaR") %>% 
  set_args(usekernel = FALSE) 

#add model and recipe to the workflow
wnb_wkflow <- workflow() %>% 
  add_model(wnb_mod) %>% 
  add_recipe(white_recipe)

#create a fit between the workflow and folded data
wnb_fit_cross <- fit_resamples(wnb_wkflow, white_fold)

#determine the roc_auc of the Naive Bayes model on the folded training data
collect_metrics(wnb_fit_cross)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.451     5 0.00984 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.734     5 0.0200  Preprocessor1_Model1

Through k-fold cross validation, we can see that the linear discriminant analysis model produces better accuracy than the Naive Bayes model.

Although, to account for the collinearity between some of our predictors, which we saw in the exploratory data analysis, we will conduct principal component analysis on the data. Since our linear discriminant analysis model was better on the white wine data set, we will use the principal components in an LDA model.

Principle Component Analysis

To conduct principal component analysis, we will begin by setting up a recipe specifically for this purpose. We can now conduct an LDA workflow and model fit. We are tuning the model to find the best number of principal components using k-fold cross validation.

white_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())

# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(white_recipe_pca$template))

# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)

wlda_pca_wkflow <- workflow() %>% 
  add_model(wlda_model) %>% 
  add_recipe(white_recipe_pca)

pca_lda_fit <- wlda_pca_wkflow %>%
  tune_grid(resamples = white_fold,
            grid = tuneGrid,
            metrics = metric_set(accuracy),
            control = trControl)


This visualization represents the number of principal components versus the accuracy of the model. We can observe an obvious spike in accuracy at 9 principal components.

pca_lda_metrics <- pca_lda_fit %>% collect_metrics()

ggplot(pca_lda_metrics, aes(x = num_comp, y = mean)) +
  geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
  scale_x_continuous(breaks = 1:ncol(white_recipe_pca$template)) +
  facet_wrap(~.metric) +
  theme_bw()

pca_lda_fit %>% show_best(metric = "accuracy")

(bestTune <- pca_lda_fit %>% select_by_one_std_err(num_comp, metric = "accuracy"))

wlda_pca_wkflow_final <- wlda_pca_wkflow %>% finalize_workflow(bestTune)

fit_final <- wlda_pca_wkflow_final %>% fit(white_train)

white.PCALDA <- tibble(white_train,
       predict(fit_final, new_data = white_train, type = "class"), # predicted class
       predict(fit_final, new_data = white_train, type = "prob"), # posterior prob. for classes
       as_tibble(predict(fit_final, new_data = white_train, type = "raw")$x)) # LD scores


This is a visualization of the actually quality and the predicted qualities. We are only displaying about half of the data, so the plot is more interpretable. The plot displays the clustering of the data very well for each quality level. In addition, we can visually see how well the model predicts the qualities accurately, and around how often/how greatly the model fails.

# plot
ggplot(white.PCALDA[1:1500,], aes(x = LD1, y = LD2)) +
  geom_point(aes(color = quality, shape = .pred_class)) + 
  theme_bw() +
  ggtitle("PCA-LDA (DAPC) on White Wine Training dataset, using 9 PC")

#augmented on training 
pcalda_fit <- augment(fit_final, new_data = white_train) 
pcalda_acc <- pcalda_fit %>% accuracy(truth = quality, estimate = .pred_class)
pcalda_rocauc <- pcalda_fit %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA ")
pcalda_roccurve <- pcalda_fit %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
pcalda_confusionmatrix <- pcalda_fit %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")


These metrics and graphs reveal how accurately the PCA LDA works on the training data. The ROC-AUC curves are the best for qualities 4 and 8, although overall PCA LDA is not the most effective as the accuracy is merely 53%. Through the confusion matrix, we can see the model predicted 6s well.

pcalda_acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.538
pcalda_rocauc
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type                       
##   <chr>   <chr>          <dbl> <chr>                            
## 1 roc_auc hand_till      0.760 "White Wine LDA Model using PCA "
pcalda_roccurve

pcalda_confusionmatrix

Single Decision Tree

Now, let’s try several tree methods to see if the produce more accurate results on the training data set of white wine. We will first look at the model for a single decision tree.

First, we set up a specification with the engine rpart and for classification.

# decision tree specification
wtree_spec <- decision_tree() %>%
  set_engine("rpart")

wtree_spec_class <- wtree_spec %>%
  set_mode("classification")


Next, we fit the specification to the training data.

wclass_tree_fit <- wtree_spec_class %>%
  fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)


Here is a visual of how the decision tree model works with our data.

wclass_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

Now, we augment the model on the training model and evaluate the accuracy and confusion matrix. The accuracy is 52% and the confusion matrix shows us that 5 and 6 quality are evaluated the best.

# augmented on training 
augment(wclass_tree_fit, new_data = white_train) %>%
  accuracy(truth = quality, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.529
augment(wclass_tree_fit, new_data = white_train) %>%
  conf_mat(truth = quality, estimate = .pred_class)%>% autoplot(type = "heatmap")

Here, we are tuning our model to determine the best measures for cost_complexity.

# tuning cost complexity 
wclass_tree_wf<- workflow() %>%
  add_model(wtree_spec_class %>% 
              set_args(cost_complexity = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

param_grid <- grid_regular(cost_complexity(range = c(-3,-1)), levels = 10)

tune_res_white <- tune_grid(
  wclass_tree_wf,
  resamples = white_fold,
  grid = param_grid,
  metric = metric_set(accuracy)
)

Now we produce a graph of accuracies and roc auc levels for various cost complexity parameters.

autoplot(tune_res_white)
wAutoPlot

# extracting the best cost complexity parameter
best_complexity <- select_best(tune_res_white)
wclass_tree_final <- finalize_workflow(wclass_tree_wf, best_complexity)
wclass_tree_final_fit <- fit(wclass_tree_final, data = white_train)

Using the measure of cost complexity which produces the best accuracy and roc auc levels, this is a visualization of the decision tree that is used in the model. It is very precise and may show signs that it would overfit on testing data.

wclass_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

# augmented on training 
wdectree_pred <- augment(wclass_tree_final_fit, new_data = white_train) 
wdectree_acc <- wdectree_pred %>% accuracy(truth = quality, estimate = .pred_class)
wdectree_rocauc <- wdectree_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Decision Tree Model")
wdectree_roccurve <- wdectree_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wdectree_confusionmatrix <- augment(wclass_tree_final_fit, new_data = white_train) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

This is the overall accuracy measures, roc auc measures, roc curves and confusion matrix that is returned my our decision tree model on the white wine data set. The accuracy is on par with the PCA LDA that we previously conducted.

print(wdectree_acc)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.647
print(wdectree_rocauc)
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.818 White Decision Tree Model
wdectree_roccurve

wdectree_confusionmatrix

Random Forest

In hopes of improving our results, we will now look at the random forest model.

# setting random forest model up
wrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
wrandfor_wf <- workflow() %>%
  add_model(wrandfor %>%
              set_args(mtry = tune(), trees = tune(), min_n = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
# tuning the model to find the best arguments 
param_grid1 <- grid_regular(mtry(range = c(1,9)), trees(range = c(15,17)), min_n(range = c(30,50)), levels = 8)

wtune_res_randfor <- tune_grid(
  wrandfor_wf,
  resamples = white_fold,
  grid = param_grid1,
  metric = metric_set(accuracy)
)

This visualization shows the accuracy and roc auc of the various values we are tuning our model with. For the final recipe, we will extract the metrics that return the best accuracy and roc auc.

autoplot(wtune_res_randfor)

# collecting metrics to find best mean
wbest_rocauc1 <- collect_metrics(wtune_res_randfor) %>% arrange(desc(mean))
wbest_metric1 <- select_best(wtune_res_randfor)

wrandfor_final <- rand_forest(mtry = 2, trees = 17, min_n = 30) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
wrandfor_fit_final <- fit(wrandfor_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)
# variance importance plot
wVIP <- vip(wrandfor_fit_final)

# extracting the metrics 
wrandfor_pred <- augment(wrandfor_fit_final, new_data = white_train) 
wrandfor_acc <- wrandfor_pred %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
wrandfor_rocauc <- wrandfor_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
wrandfor_roccurve <- wrandfor_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wrandfor_confusionmatrix <- augment(wrandfor_fit_final, new_data = white_train) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
wVIP

print(wrandfor_acc)
## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass     0.800 White Random Forest Model
print(wrandfor_rocauc)
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.968 White Random Forest Model
wrandfor_roccurve

wrandfor_confusionmatrix


Overall the random forest for the white wine data set has a drastic improvement in terms of the accuracy and roc auc. The variance importance plot reveals that alcohol, volatile acidity and density are most important variables in the data. Every single ROC AUC curve is almost perfect, although as we have learned through the course, this model may be over fitting, and as a result, may not perform as well on the testing data set. This model did the best for predicting wine qualities of 5,6, and 7.

Boosted Trees

Finally, let’s look at the boosted tree model.

wboost_spec <-  boost_tree(tree_depth = 5) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")
wboost_wf <- workflow() %>%
  add_model(wboost_spec %>%
              set_args(trees = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
param_grid3 <- grid_regular(trees(range = c(10,2000)), levels = 10)

wtune_res_boosted <- tune_grid(
  wboost_wf,
  resamples = white_fold,
  grid = param_grid3
)


This plot below visualizes the accuracy and roc auc levels for the number of trees. Since we tune the value for the number of trees, we can choose to use the number of trees which has the highest accuracy and roc auc levels.

wBoostedAutoPlot

wbest_rocauc2 <- collect_metrics(wtune_res_boosted) %>% arrange(desc(mean))
wbest_metric2 <- select_best(wtune_res_boosted)
print(wbest_metric2)
wboost_final <- boost_tree(tree_depth = 5, trees = 231)%>% 
  set_engine("xgboost") %>% 
  set_mode("classification")
wboost_fit_final <- fit(wboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)
# augmenting 
wpredicted <- augment(wboost_fit_final, new_data = white_train) 
wboosted_acc <- wpredicted %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
wboosted_rocauc <-  wpredicted %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
wBoostedROCCurve <- wpredicted %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wBoostedConfusionMatrix <- wpredicted %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
wboosted_acc
## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass         1 White Boosted Trees Model
wboosted_rocauc
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till          1 White Boosted Trees Model
wBoostedROCCurve

wBoostedConfusionMatrix


The accuracy and roc auc for the boosted trees model on the white wine dataset is 1, meaning every single observation was correctly classified in the model. This model has the potential to run amazingly on the testing data, although there is undeniable evidence of overfitting.

Determining Best Fit

wbest_roc_table <- rbind(wbest_rocauc[1,c(2,4)], wbest_rocauc1[1,c(4,6)], wbest_rocauc2[1,c(2,4)] ) %>% mutate(model_type = c("Decision Tree", "Random Forest", "Boosted Trees"))
wbest_roc_table
## # A tibble: 3 × 3
##   .metric  mean model_type   
##   <chr>   <dbl> <chr>        
## 1 roc_auc 0.688 Decision Tree
## 2 roc_auc 0.794 Random Forest
## 3 roc_auc 0.805 Boosted Trees
# We will test our model using Boosted Trees and Random Forest 

As we can see, between the three tree based methods we ran, the two with the highest roc auc were Boosted Trees and Random Forest. LDA using PCA had a higher accuracy than Decision Trees so we will use that as our third model.

To conclude, we will test our data on three models: LDA using PCA, Random Forest and Boosted Trees.

Predicting on the Testing Data


Testing the PCA LDA model on the data:


pcalda_fit_test <- augment(fit_final, new_data = white_test) 
pcalda_test_acc <- pcalda_fit_test %>% accuracy(truth = quality, estimate = .pred_class)
pcalda_test_rocauc <- pcalda_fit_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA")
pcalda_test_roccurve <- pcalda_fit_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
pcalda_test_confusionmatrix <- pcalda_fit_test %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
pcalda_test_acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.514
pcalda_test_rocauc
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type                    
##   <chr>   <chr>          <dbl> <chr>                         
## 1 roc_auc hand_till      0.685 White Wine LDA Model using PCA
pcalda_test_roccurve

pcalda_test_confusionmatrix


Testing the Random Forest model on the data:


wrandfor_pred_test <- augment(wrandfor_fit_final, new_data = white_test) 
wrandfor_acc_test <- wrandfor_pred_test %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
wrandfor_rocauc_test <- wrandfor_pred_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
wrandfor_roccurve_test <- wrandfor_pred_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wrandfor_confusionmatrix_test <- augment(wrandfor_fit_final, new_data = white_test) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
wrandfor_acc_test
## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass     0.608 White Random Forest Model
wrandfor_rocauc_test
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.764 White Random Forest Model
wrandfor_roccurve_test

wrandfor_confusionmatrix_test


Testing the Boosted Trees model on the data:


wpredictedtest <- augment(wboost_fit_final, new_data = white_test) 
wboosted_acc_test <- wpredictedtest %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
wboosted_rocauc_test <- wpredictedtest %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
wBoostedROCCurveTesting <- wpredictedtest %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wBoostedConfusionMatrixTesting <- wpredictedtest %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
wboosted_acc_test
## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass     0.657 White Boosted Trees Model
wboosted_rocauc_test
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.821 White Boosted Trees Model
wBoostedROCCurveTesting

wBoostedConfusionMatrixTesting

Conclusion for White Wine Data Set

In conclusion, the Boosted Trees model did the best on the testing data for the White Wine data set with an accuracy of 65.7% and a roc auc level of 0.82. The Random Forest model is a close second with accuracy 60.8% and a roc auc value of 0.76. LDA did not work very well, and using principal component analysis did not have much of an improvement. We thought that it would due to the fact that many of our predictors had high correlation. Majority of our models do well in predicting values of 5 and 6. Although, in general, we have seen that all of the models thus far do not do as well with 3,4, and 8. This is probably due to the fact that there are not as many observations with this quality level.

Model Fitting for Red Wine

Next, let’s see how the models perform on the red wine data set. We will be using the same type of models that we used in white wine in order to keep things consistent.

Recipe

red_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())

K-Fold Cross Validation

Linear Discriminant Analysis

#lda model using cross validation
rlda_model <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS") 

rlda_wkflow<- workflow() %>% 
  add_model(rlda_model) %>% 
  add_recipe(red_recipe)

rlda_fit_cross <- fit_resamples(rlda_wkflow, red_fold)

collect_metrics(rlda_fit_cross)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.601     5 0.0127  Preprocessor1_Model1
## 2 roc_auc  hand_till  0.786     5 0.00468 Preprocessor1_Model1

Naive Bayes

#naive bayes model using cross validation
rnb_mod <- naive_Bayes() %>% 
  set_mode("classification") %>% 
  set_engine("klaR") %>% 
  set_args(usekernel = FALSE) 

rnb_wkflow <- workflow() %>% 
  add_model(rnb_mod) %>% 
  add_recipe(red_recipe)

rnb_fit_cross <- fit_resamples(rnb_wkflow, red_fold)

collect_metrics(rnb_fit_cross)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.534     5 0.00828 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.743     5 0.00714 Preprocessor1_Model1

Through k-fold cross validation, we can see that the linear discriminant analysis model produces a more accurate model than the Naive Bayes model. Thus, we will also use linear discriminant analysis on the red wine data set.

Principle Component Analysis

red_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())
# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(red_recipe_pca$template))
# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)
rlda_pca_wkflow <- workflow() %>% 
  add_model(rlda_model) %>% 
  add_recipe(red_recipe_pca)

rpca_lda_fit <- rlda_pca_wkflow %>%
  tune_grid(resamples = red_fold,
            grid = tuneGrid,
            metrics = metric_set(accuracy),
            control = trControl)

rpca_lda_metrics <- rpca_lda_fit %>% collect_metrics()
ggplot(rpca_lda_metrics, aes(x = num_comp, y = mean)) +
  geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
  scale_x_continuous(breaks = 1:ncol(red_recipe_pca$template)) +
  facet_wrap(~.metric) +
  theme_bw()

rpca_lda_fit %>% show_best(metric = "accuracy")

(bestTune <- rpca_lda_fit %>% select_by_one_std_err(num_comp, metric = "accuracy"))

rlda_pca_wkflow_final <- rlda_pca_wkflow %>% finalize_workflow(bestTune)

rfit_final <- rlda_pca_wkflow_final %>% fit(red_train)
red.PCALDA <- tibble(red_train,
       predict(rfit_final, new_data =red_train, type = "class"), # predicted class
       predict(rfit_final, new_data = red_train, type = "prob"), # posterior prob. for classes
       as_tibble(predict(rfit_final, new_data = red_train, type = "raw")$x)) # LD scores
# plot
ggplot(red.PCALDA, aes(x = LD1, y = LD2)) +
  geom_point(aes(color = quality, shape = .pred_class)) + 
  theme_bw() +
  ggtitle("PCA-LDA (DAPC) on Red Wine Training dataset, using 9 PC")

rpcalda_fit <- augment(rfit_final, new_data = red_train) 
rpcalda_acc <- rpcalda_fit %>% accuracy(truth = quality, estimate = .pred_class)
rpcalda_rocauc <- rpcalda_fit %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA ")
rpcalda_roccurve <- rpcalda_fit %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rpcalda_confusionmatrix <- rpcalda_fit %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
rpcalda_acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.618
rpcalda_rocauc
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type                       
##   <chr>   <chr>          <dbl> <chr>                            
## 1 roc_auc hand_till      0.826 "White Wine LDA Model using PCA "
rpcalda_roccurve

rpcalda_confusionmatrix

Single Decision Tree

# decision tree specification
rtree_spec <- decision_tree() %>%
  set_engine("rpart")

# setting mode to classification
rtree_spec_class <- rtree_spec %>%
  set_mode("classification")
rclass_tree_fit <- rtree_spec_class %>%
  fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)
rclass_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
# augmented on training 
augment(rclass_tree_fit, new_data = red_train) %>%
  accuracy(truth = quality, estimate = .pred_class)

augment(rclass_tree_fit, new_data = red_train) %>%
  conf_mat(truth = quality, estimate = .pred_class)
# tuning cost complexity 
rclass_tree_wf<- workflow() %>%
  add_model(rtree_spec_class %>% 
              set_args(cost_complexity = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
param_grid <- grid_regular(cost_complexity(range = c(-3,-1)), levels = 10)

tune_res_red <- tune_grid(
  rclass_tree_wf,
  resamples = red_fold,
  grid = param_grid,
  metric = metric_set(accuracy)
)
autoplot(tune_res_red)

# extracting the best cost complexity parameter
best_complexity <- select_best(tune_res_red)

rclass_tree_final <- finalize_workflow(rclass_tree_wf, best_complexity)

rclass_tree_final_fit <- fit(rclass_tree_final, data = red_train)
rclass_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

# augmented on training 
augment(rclass_tree_final_fit, new_data = red_train) %>%
  accuracy(truth = quality, estimate = .pred_class)

augment(rclass_tree_final_fit, new_data = red_train) %>%
  conf_mat(truth = quality, estimate = .pred_class)
# augmented on testing 
augment(rclass_tree_final_fit, new_data = red_test) %>%
  conf_mat(truth = quality, estimate = .pred_class) 

augment(rclass_tree_final_fit, new_data = red_test) %>%
  accuracy(truth = quality, estimate = .pred_class)

Random Forest

rrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
rrandfor_wf <- workflow() %>%
  add_model(rrandfor %>%
              set_args(mtry = tune(), trees = tune(), min_n = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
# tuning the model to find the best arguments 
param_grid2 <- grid_regular(mtry(range = c(1,9)), trees(range = c(15,17)), min_n(range = c(30,50)), levels = 8)

rtune_res_randfor <- tune_grid(
  rrandfor_wf,
  resamples = red_fold,
  grid = param_grid2
)
rAutoPlotRF

# collecting metrics to find best mean
rbest_rocauc1 <- collect_metrics(rtune_res_randfor) %>% arrange(desc(mean))
print(rbest_rocauc1)
rbest_metric1 <- select_best(rtune_res_randfor)

rrandfor_final <- rand_forest(mtry = 7, trees = 17, min_n = 32) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
rrandfor_fit_final <- fit(rrandfor_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol,data = red_train)
rVIP <- vip(rrandfor_fit_final)
rVIP

Boosted Tree

rboost_spec <-  boost_tree(tree_depth = 5) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")
rboost_wf <- workflow() %>%
  add_model(rboost_spec %>%
              set_args(trees = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
param_grid4 <- grid_regular(trees(range = c(10,2000)), levels = 10)

rtune_res_boosted <- tune_grid(
  rboost_wf,
  resamples = red_fold,
  grid = param_grid4
)
rBoostedAutoPlot

rbest_rocauc2 <- collect_metrics(rtune_res_boosted) %>% arrange(desc(mean))
print(rbest_rocauc2)

rbest_metric2 <- select_best(rtune_res_boosted)
print(rbest_metric2)

rboost_final <- boost_tree(tree_depth = 5, trees = 231)%>% 
  set_engine("xgboost") %>% 
  set_mode("classification")
rboost_fit_final <- fit(rboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)

rpredicted <- augment(rboost_fit_final, new_data = red_train) 
rboosted_acc <- rpredicted %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
rboosted_rocauc <-  rpredicted %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
rBoostedROCCurve <- rpredicted %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rBoostedConfusionMatrix <- rpredicted %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

Determining Best Fit

rbest_rocauc
rbest_rocauc1
rbest_rocauc2
rbest_roc_table <- rbind(rbest_rocauc[1,c(2,4)], rbest_rocauc1[1,c(4,6)], rbest_rocauc2[1,c(2,4)] ) %>% mutate(model_type = c("Decision Tree", "Random Forest", "Boosted Trees"))
rbest_roc_table

Predicting on the Testing Data

rpcalda_fit_test <- augment(rfit_final, new_data = red_test) 
rpcalda_test_acc <- rpcalda_fit_test %>% accuracy(truth = quality, estimate = .pred_class)
rpcalda_test_rocauc <- rpcalda_fit_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "Red Wine LDA Model using PCA")
rpcalda_test_roccurve <- rpcalda_fit_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rpcalda_test_confusionmatrix <- rpcalda_fit_test %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
rpcalda_test_acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.561
rpcalda_test_rocauc
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type                  
##   <chr>   <chr>          <dbl> <chr>                       
## 1 roc_auc hand_till      0.778 Red Wine LDA Model using PCA
rpcalda_test_roccurve

rpcalda_test_confusionmatrix

rrandfor_pred_test <- augment(wrandfor_fit_final, new_data = white_test) 
rrandfor_acc_test <- wrandfor_pred_test %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
rrandfor_rocauc_test <- wrandfor_pred_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
rrandfor_roccurve_test <- wrandfor_pred_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rrandfor_confusionmatrix_test <- augment(wrandfor_fit_final, new_data = white_test) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
rrandfor_pred_test
## # A tibble: 481 × 20
##    fixed…¹ volat…² citri…³ resid…⁴ chlor…⁵ free.…⁶ total…⁷ density    pH sulph…⁸
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>
##  1     7.8   0.76     0.04     2.3   0.092      15      54   0.997  3.26    0.65
##  2     5.6   0.615    0        1.6   0.089      16      59   0.994  3.58    0.52
##  3     8.9   0.62     0.19     3.9   0.17       51     148   0.999  3.17    0.93
##  4     8.1   0.56     0.28     1.7   0.368      16      56   0.997  3.11    1.28
##  5     7.9   0.32     0.51     1.8   0.341      17      56   0.997  3.04    1.08
##  6     8.9   0.22     0.48     1.8   0.077      29      60   0.997  3.39    0.53
##  7     7.6   0.39     0.31     2.3   0.082      23      71   0.998  3.52    0.65
##  8     8.5   0.49     0.11     2.3   0.084       9      67   0.997  3.17    0.53
##  9     7.6   0.41     0.24     1.8   0.08        4      11   0.996  3.28    0.59
## 10     6.9   0.685    0        2.5   0.105      22      37   0.997  3.46    0.57
## # … with 471 more rows, 10 more variables: alcohol <dbl>, quality <fct>,
## #   type <chr>, .pred_class <fct>, .pred_3 <dbl>, .pred_4 <dbl>, .pred_5 <dbl>,
## #   .pred_6 <dbl>, .pred_7 <dbl>, .pred_8 <dbl>, and abbreviated variable names
## #   ¹​fixed.acidity, ²​volatile.acidity, ³​citric.acid, ⁴​residual.sugar,
## #   ⁵​chlorides, ⁶​free.sulfur.dioxide, ⁷​total.sulfur.dioxide, ⁸​sulphates
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
rrandfor_acc_test
## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type             
##   <chr>    <chr>          <dbl> <chr>                  
## 1 accuracy multiclass     0.632 Red Random Forest Model
rrandfor_rocauc_test
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.792 White Random Forest Model
rrandfor_roccurve_test

rrandfor_confusionmatrix_test

rpredictedtest <- augment(wboost_fit_final, new_data = white_test) 
rboosted_acc_test <- wpredictedtest %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
rboosted_rocauc_test <- wpredictedtest %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
rBoostedROCCurveTesting <- wpredictedtest %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rBoostedConfusionMatrixTesting <- wpredictedtest %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")
rpredictedtest
## # A tibble: 481 × 20
##    fixed…¹ volat…² citri…³ resid…⁴ chlor…⁵ free.…⁶ total…⁷ density    pH sulph…⁸
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>
##  1     7.8   0.76     0.04     2.3   0.092      15      54   0.997  3.26    0.65
##  2     5.6   0.615    0        1.6   0.089      16      59   0.994  3.58    0.52
##  3     8.9   0.62     0.19     3.9   0.17       51     148   0.999  3.17    0.93
##  4     8.1   0.56     0.28     1.7   0.368      16      56   0.997  3.11    1.28
##  5     7.9   0.32     0.51     1.8   0.341      17      56   0.997  3.04    1.08
##  6     8.9   0.22     0.48     1.8   0.077      29      60   0.997  3.39    0.53
##  7     7.6   0.39     0.31     2.3   0.082      23      71   0.998  3.52    0.65
##  8     8.5   0.49     0.11     2.3   0.084       9      67   0.997  3.17    0.53
##  9     7.6   0.41     0.24     1.8   0.08        4      11   0.996  3.28    0.59
## 10     6.9   0.685    0        2.5   0.105      22      37   0.997  3.46    0.57
## # … with 471 more rows, 10 more variables: alcohol <dbl>, quality <fct>,
## #   type <chr>, .pred_class <fct>, .pred_3 <dbl>, .pred_4 <dbl>, .pred_5 <dbl>,
## #   .pred_6 <dbl>, .pred_7 <dbl>, .pred_8 <dbl>, and abbreviated variable names
## #   ¹​fixed.acidity, ²​volatile.acidity, ³​citric.acid, ⁴​residual.sugar,
## #   ⁵​chlorides, ⁶​free.sulfur.dioxide, ⁷​total.sulfur.dioxide, ⁸​sulphates
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
rboosted_acc_test
## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass     0.674 White Boosted Trees Model
rboosted_rocauc_test
## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.774 White Boosted Trees Model
rBoostedROCCurveTesting

rBoostedConfusionMatrixTesting

red conclusion

Final thoughts